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METHOD AND APPARATUS FOR PARTITIONING A RESOURCE 
BETWEEN MULTIPLE THREADS WITHIN A MULTI-THREADED 

PROCESSOR 

5 

FIELD OF THE INVENTION 

The present invention relates generally to the field of multi-threaded 
processors and, more specifically, to a method and apparatus for 
partitioning a processor resource within a multi-threaded processor. 

10 

BACKGROUND OF THE INVENTION 

Multi-threaded processor design has recently been considered as an 
increasingly attractive option for increasing the performance of processors. 
Multithreading within a processor, inter alia, provides the potential for more 

15 effective utilization of various processor resources, and particularly for more 
effective utilization of the execution logic within a processor. Specifically, by 
feeding multiple threads to the execution logic of a processor, clock cycles 
that would otherwise have been idle due to a stall or other delay in the 
processing of a particular thread may be utilized to service another thread. 

20 A stall in the processing of a particular thread may result from a number of 
occurrences within a processor pipeline. For example, a cache miss or a 
branch missprediction (i.e., a long-latency operation) for an instruction 
included within a thread typically results in the processing of the relevant 
thread stalling. The negative effect of long-latency operations on execution 

25 logic efficiencies is exacerbated by the recent increases in execution logic 
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throughput that have outstripped advances in memory access and retrieval 
rates. 

Multi-threaded computer applications are also becoming increasingly 
common in view of the support provided to such multi-threaded 
5 applications by a number of popular operating systems, such as the 
Windows NT® and Unix operating systems. Multi-threaded computer 
applications are particularly efficient in the multi-media arena. 

Multi-threaded processors may broadly be classified into two 
categories (i.e., fine or coarse designs) according to the thread interleaving or 

10 switching scheme employed within the relevant processor. Fine multi- 
threaded designs support multiple active threads within a processor and 
typically interleave two different threads on a cycle-by-cycle basis. Coarse 
multi-threaded designs typically interleave the instructions of different 
threads on the occurrence of some long-latency event, such as a cache miss. 

15 A coarse multi-threaded design is discussed in Eickemayer, R.; Johnson, R.; 
et aL, "Evaluation of Multithreaded Uniprocessors for Commercial 
Application Environments", The 23rd Annual International Symposium on 
Computer Architecture, pp. 203-212, May 1996. The distinctions between 
fine and coarse designs are further discussed in Laudon, J; Gupta, A, " 

20 Architectural and Implementation Tradeoffs in the Design of Multiple- 
Context Processors", Multithreaded Computer Architectures: A Summary of 
the State of the Art edited by R.A. Iannuci et aL, pp. 167-200, Kluwer 
Academic Publishers, Norwell, Massachusetts, 1994. Laudon further 
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proposes an interleaving scheme that combines the cycle-by-cycle switching 
of a fine design with the full pipeline interlocks of a coarse design (or 
blocked scheme). To this end, Laudon proposes a "back off 1 instruction that 
makes a specific thread (or context) unavailable for a specific number of 

5 cycles. Such a "back off" instruction may be issued upon the occurrence of 
predetermined events, such as a cache miss. In this way, Laudon avoids 
having to perform an actual thread switch by simply making one of the 
threads unavailable. 

Where resource sharing is implemented within a multi-threaded 

10 processor (i.e., there is limited or no duplication of function units for each 
thread supported by the processor) it is desirable to effectively share 
resources between the threads. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example and not 
limited in the figures of the accompanying drawings, in which like 
references indicate similar elements and in which: 

5 

Figure 1 is a block diagram illustrating an exemplary pipeline of a 
processor within which the present invention may be implemented. 

Figure 2 is a block diagram illustrating an exemplary embodiment of 
10 a processor, in the form of a general-purpose multi-threaded 

microprocessor, within which the present invention may be 
implemented. 

Figure 3 is a block diagram illustrating selected components of an 
15 exemplary multi-threaded microprocessor, and specifically depicts 

various functional units that provide a buffering (or storage) 
capability as being logically partitioned to accommodate multiple 
thread. 

20 Figure 4 is a block diagram showing further details regarding various 

components of an exemplary trace delivery engine (TDE). 



Figure 5 is a block diagram illustrating further architectural details of 
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an exemplary trace cache fill buffer. 

Figure 6 is a block diagram illustrating further architectural details of 
an exemplary trace cache (TC). 

Figure 7 is a block diagram illustrating further structural details of an 
exemplary trace cache (TC) 

Figure 8 is a block diagram illustrating various inputs and outputs of 
exemplary thread selection logic. 

Figure 9 is a block diagram illustrating three exemplary components 
of exemplary thread selection logic in the form of a thread selection 
state machine, and a counter and comparator for a second thread. 

Figure 10 is a state diagram illustrating exemplary operation of an 
exemplary thread selection state machine. 

Figure 11 is a block diagram illustrating architectural details of an 
exemplary embodiment of victim selection logic. 

Figure 12 is a flow chart illustrating an exemplary method of 
partitioning a memory resource, such as for example a trace cache, 
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within a multi-threaded processor. 

Figure 13 is a flow chart illustrating an exemplary method of 
partitioning a resource, in the exemplary form of a memory resource, 
5 utilizing a Least Recently Used (LRU) history associated with the 

relevant memory resource. 

Figure 14 is a block diagram illustrating an exemplary LRU history 
data structure. 

10 

Figure 15 is a block diagram illustrating further details pertaining to 
inputs to, and outputs from, exemplary victim selection logic. 
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DETAILED DESCRIPTION 

A method and apparatus for partitioning a processor resource within 
a multi-threaded processor are described. In the following description, for 
purposes of explanation, numerous specific details are set forth in order to 

5 provide a thorough understanding of the present invention. It will be 

evident, however, to one skilled in the art that the present invention may be 
practiced without these specific details. 

For the purposes of the present specification, the term "event" shall be 
taken to include any event, internal or external to a processor, that causes a 

10 change or interruption to the servicing of an instruction stream (macro- or 
micro-instruction) within a processor. Accordingly, the term "event" shall be 
taken to include, but not limited to, branch instructions, exceptions and 
interrupts that may be generated within or outside the processor. 

For the purposes of the present specification, the term "processor" 

15 shall be taken to refer to any machine that is capable of executing a sequence 
of instructions (e.g., macro- or micro-instructions), and shall be taken to 
include, but not be limited to, general purpose microprocessors, special 
purpose microprocessors, graphics controllers, audio controllers, multi- 
media controllers and microcontrollers. Further, the term "processor" shall 

20 be taken to refer to, inter alia, Complex Instruction Set Computers (CISC), 
Reduced Instruction Set Computers (RISC), or Very Long Instruction Word 
(VLIW) processors. 

For the purposes of the present specification, the term "resource" shall 
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be taken to include any unit, component or module of a processor, and shall 
be taken to include, but not be limited to, a memory resource, a processing 
resource, a buffering resource, a communications resource or bus, a 
sequencing resource or a translating resource. 

5 

Processor Pipeline 
Figure 1 is a high-level block diagram illustrating an exemplary 
embodiment of processor pipeline 10 within which the present invention 
may be implemented. The pipeline 10 includes a number of pipe stages, 

10 commencing with a fetch pipe stage 12 at which instructions (e.g., 

macroinstructions) are retrieved and fed into the pipeline 10. For example, a 
macroinstruction may be retrieved from a cache memory that is integral 
with the processor, or closely associated therewith, or may be retrieved from 
an external main memory via a processor bus. From the fetch pipe stage 12, 

15 the macroinstructions are propagated to a decode pipe stage 14, where 
macroinstructions are translated into microinstructions (also termed 
"microcode") suitable for execution within the processor. The 
microinstructions are then propagated downstream to an allocate pipe stage 
16, where processor resources are allocated to the various microinstructions 

20 according to availability and need. The microinstructions are then executed 
at an execute stage 18 before being retired, or "written-back" (e.g., committed 
to an architectural state) at a retire pipe stage 20. 
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Microprocessor Architecture 
Figure 2 is a block diagram illustrating an exemplary embodiment of 
a processor 30, in the form of a general-purpose microprocessor, within 
which the present invention may be implemented. The processor 30 is 

5 described below as being a multi-threaded (MT) processor, and is 

accordingly able simultaneously to process multiple instruction threads (or 
contexts). However, a number of the teachings provided below in the 
specification are not specific to a multi-threaded processor, and may find 
application in a single threaded processor. In an exemplary embodiment, 

10 the processor 30 may comprise an Intel Architecture (IA) microprocessor 
that is capable of executing the Intel Architecture instruction set. An 
example of such an Intel Architecture microprocessor is the Pentium Pro ® 
microprocessor or the Pentium III ® microprocessor manufactured by Intel 
Corporation of Santa Clara, California. 

15 The processor 30 comprises an in-order front end and an out-of-order 

back end. The in-order front end includes a bus interface unit 32, which 
functions as the conduit between the processor 30 and other components 
(e.g., main memory) of a computer system within which the processor 30 
may be employed. To this end, the bus interface unit 32 couples the 

20 processor 30 to a processor bus (not shown) via which data and control 

information may be received at and propagated from the processor 30. The 
bus interface unit 32 includes Front Side Bus (FSB) logic 34 that controls 
communications over the processor bus. The bus interface unit 32 further 
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includes a bus queue 36 that provides a buffering function with respect to 
communications over the processor bus. The bus interface unit 32 is shown 
to receive bus requests 38 from, and to send snoops or bus returns 40 to, a 
memory execution unit 42 that provides a local memory capability within 

5 the processor 30. The memory execution unit 42 includes a unified data and 
instruction cache 44, a data Translation Lookaside Buffer (TLB) 46, and 
memory ordering buffer 48. The memory execution unit 42 receives 
instruction fetch requests 50 from, and delivers raw instructions 52 (i.e., 
coded macroinstructions) to, a microinstruction translation engine 54 that 

10 translates the received macroinstructions into a corresponding set of 
microinstructions. 

The microinstruction translation engine 54 effectively operates as a 
trace cache "miss handler" in that it operates to deliver microinstructions to a 
trace cache 62 in the event of a trace cache miss. To this end, the 

15 microinstruction translation engine 54 functions to provide the fetch and 
decode pipe stages 12 and 14 in the event of a trace cache miss. The 
microinstruction translation engine 54 is shown to include a next instruction 
pointer (NIP) 100, an instruction Translation Lookaside Buffer (TLB) 102, a 
branch predictor 104, an instruction streaming buffer 106, an instruction pre- 

20 decoder 108, instruction steering logic 110, an instruction decoder 112, and a 
branch address calculator 114. The next instruction pointer 100, TLB 102, 
branch predictor 104 and instruction streaming buffer 106 together 
constitute a branch prediction unit (BPU) 99. The instruction decoder 112 
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and branch address calculator 114 together comprise an instruction translate 
(IX) unit 113. 

The next instruction pointer 100 issues next instruction requests to 
the unified cache 44. Li the exemplary embodiment where the processor 30 

5 comprises a multi-threaded microprocessor capable of processing two 
threads, the next instruction pointer 100 may include a multiplexer (MUX) 
(not shown) that selects between instruction pointers associated with either 
the first or second thread for inclusion within the next instruction request 
issued therefrom. In one embodiment, the next instruction pointer 100 will 

10 interleave next instruction requests for the first and second threads on a 
cycle-by-cycle ("ping pong") basis, assuming instructions for both threads 
have been requested, and instruction streaming buffer 106 resources for both 
of the threads have not been exhausted. The next instruction pointer 
requests may be for either 16, 32 or 64-bytes depending on whether the 

15 initial request address is in the upper half of a 32-byte or 64-byte aligned 
line. The next instruction pointer 100 may be redirected by the branch 
predictor 104, the branch address calculator 114 or by the trace cache 62, 
with a trace cache miss request being the highest priority redirection 
request. 

20 When the next instruction pointer 100 makes an instruction request to 

the unified cache 44, it generates a two-bit "request identifier" that is 
associated with the instruction request and functions as a "tag" for the 
relevant instruction request. When returning data responsive to an 
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instruction request, the unified cache 44 returns the following tags or 
identifiers together with the data: 

1. The "request identifier" supplied by the next instruction 
pointer 100; 

5 2. A three-bit "chunk identifier" that identifies the chunk 

returned; and 

3. A "thread identifier" that identifies the thread to which the 

returned data belongs. 
Next instruction requests are propagated from the next instruction 

10 pointer 100 to the instruction TLB 102, which performs an address lookup 
operation, and delivers a physical address to the unified cache 44. The 
unified cache 44 delivers a corresponding macroinstruction to the 
instruction streaming buffer 106. Each next instruction request is also 
propagated directly from the next instruction pointer 100 to the instruction 

15 streaming buffer 106 so as to allow the instruction streaming buffer 106 to 
identify the thread to which a macroinstruction received from the unified 
cache 44 belongs. The macroinstructions from both first and second threads 
are then issued from the instruction streaming buffer 106 to the instruction 
pre-decoder 108, which performs a number of length calculation and byte 

20 marking operations with respect to a received instruction stream (of 

macroinstructions). Specifically, the instruction pre-decoder 108 generates a 
series of byte marking vectors that serve, inter alia, to demarcate 
macroinstructions within the instruction stream propagated to the 
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instruction steering logic 110. 

The instruction steering logic 110 then utilizes the byte marking 
vectors to steer discrete macroinstructions to the instruction decoder 112 for 
the purposes of decoding. Macroinstructions are also propagated from the 
instruction steering logic 110 to the branch address calculator 114 for the 
purposes of branch address calculation. Microinstructions are then 
delivered from the instruction decoder 112 to the trace delivery engine 60. 

During decoding, flow markers are associated with each 
microinstruction. A flow marker indicates a characteristic of the associated 
microinstruction and may, for example, indicate the associated 
microinstruction as being the first or last microinstruction in a microcode 
sequence representing a macroinstruction. The flow markers include a 
"beginning of macroinstruction" (BOM) and an "end of macroinstruction" 
(EOM) flow markers. According to the present invention, the decoder 112 
may further decode the microinstructions to have shared resource 
(multiprocessor) (SHRMP) flow markers and synchronization (SYNC) flow 
markers associated therewith. Specifically, a shared resource flow marker 
identifies a microinstruction as a location within a particular thread at which 
the thread may be interrupted (e.g., re-started or paused) with less negative 
consequences than elsewhere in the thread. The decoder 112, in an 
exemplary embodiment of the present invention, is constructed to mark 
microinstructions that comprise the end or the beginning of a parent 
macroinstruction with a shared resource flow marker. A synchronization 
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flow market identifies a microinstruction as a location within a particular 
thread at which the thread may be synchronized with another thread 
responsive to, for example, a synchronization instruction within the other 
thread. 

From the microinstruction translation engine 54, decoded instructions 
(Le., microinstructions) are sent to a trace delivery engine 60. The trace 
delivery engine 60 includes the trace cache 62, a trace branch predictor (BTB) 
64, a microcode sequencer 66 and a microcode (uop) queue 68. The trace 
delivery engine 60 functions as a microinstruction cache, and is the primary 
source of microinstructions for a downstream execution unit 70. By 
providing a microinstruction caching function within the processor pipeline, 
the trace delivery engine 60, and specifically the trace cache 62, allows 
translation work done by the microinstruction translation engine 54 to be 
leveraged to provide an increased microinstruction bandwidth. In one 
exemplary embodiment, the trace cache 62 may comprise a 256 set, 8 way set 
associate memory. The term "trace", in the present exemplary embodiment, 
may refer to a sequence of microinstructions stored within entries of the 
trace cache 62, each entry including pointers to preceding and proceeding 
microinstructions comprising the trace. In this way, the trace cache 62 
facilitates high-performance sequencing in that the address of the next entry 
to be accessed for the purposes of obtaining a subsequent microinstruction is 
known before a current access is complete. Traces may be viewed as 
"blocks" of instructions that are distinguished from one another by trace 
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heads, and are terminated upon encountering an indirect branch or by 
reaching one of many present threshold conditions, such as the number of 
conditioned branches that may be accommodated in a single trace or the 
maximum number of total microinstructions that may comprise a trace. 

5 The trace cache branch prediction unit 64 provides local branch predictions 
pertaining to traces within the trace cache 62. The trace cache 62 and the 
microcode sequencer 66 provide microinstructions to the microcode queue 
68, from where the microinstructions are then fed to an out-of-order 
execution cluster. The microcode sequencer 66 furthermore includes a 

10 number of event handlers embodied in microcode, that implement a number 
of operations within the processor 30 in response to the occurrence of an 
event such as an exception or an interrupt. The event handlers 67 are 
invoked by an event detector (not shown) included within a register 
renamer 74 in the back end of the processor 30. 

15 The processor 30 may be viewed as having an in-order front-end, 

comprising the bus interface unit 32, the memory execution unit 42, the 
microinstruction translation engine 54 and the trace delivery engine 60, and 
an out-of-order back-end that will be described in detail below. 

Microinstructions dispatched from the microcode queue 68 are 

20 received into an out-of-order cluster 71 comprising a scheduler 72, the 

register renamer 74, an allocator 76, a reorder buffer 78 and a replay queue 
80. The scheduler 72 includes a set of reservation stations, and operates to 
schedule and dispatch microinstructions for execution by the execution unit 
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70. The register renamer 74 performs a register renaming function with 
respect to hidden integer and floating point registers (that may be utilized in 
place of any of the eight general purpose registers or any of the eight 
floating-point registers, where a processor 30 executes the Intel Architecture 

5 instruction set). The allocator 76 operates to allocate resources of the 
execution unit 70 and the cluster 71 to microinstructions according to 
availability and need. In the event that insufficient resources are available to 
process a microinstruction, the allocator 76 is responsible for asserting a stall 
signal 82, that is propagated through the trace delivery engine 60 to the 

10 microinstruction translation engine 54, as shown at 58. Microinstructions, 
which have had their source fields adjusted by the register renamer 74, are 
placed in a reorder buffer 78 in strict program order. When 
microinstructions within the reorder buffer 78 have completed execution 
and are ready for retirement, they are then removed from the reorder buffer 

15 162. The replay queue 80 propagates microinstructions that are to be 
replayed to the execution unit 70. 

The execution unit 70 is shown to include a floating-point execution 
engine 84, an integer execution engine 86, and a level 0 data cache 88. In one 
exemplary embodiment in which is the processor 30 executes the Intel 

20 Architecture instruction set, the floating point execution engine 84 may 
further execute MMX® instructions. 
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Multithreading Implementation 
In the exemplary embodiment of the processor 30 illustrated in 
Figure 2, there may be limited duplication or replication of resources to 
support a multithreading capability, and it is accordingly necessary to 

5 implement some degree of resource sharing between threads. The resource 
sharing scheme employed, it will be appreciated, is dependent upon the 
number of threads that the processor is able simultaneously to process. As 
functional units within a processor typically provide some buffering (or 
storage) functionality and propagation functionality, the issue of resource 

10 sharing may be viewed as comprising (1) storage and (2) 

processing/propagating bandwidth sharing components. For example, in a 
processor that supports the simultaneous processing of two threads, buffer 
resources within various functional units may be statically or logically 
partitioned between two threads. Similarly, the bandwidth provided by a 

15 path for the propagation of information between two functional units must 
be divided and allocated between the two threads. As these resource 
sharing issues may arise at a number of locations within a processor 
pipeline, different resource sharing schemes may be employed at these 
various locations in accordance with the dictates and characteristics of the 

20 specific location. It will be appreciated that different resource sharing 
schemes may be suited to different locations in view of varying 
functionalities and operating characteristics. 

Figure 3 is a block diagram illustrating selected components of the 
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processor 30 illustrated in Figure 2, and depicts various functional units that 
provide a buffering capability as being logically partitioned to accommodate 
two threads (i.e., thread 0 and thread 1). The logical partitioning for two 
threads of the buffering (or storage) and processing facilities of a functional 

5 unit may be achieved by allocating a first predetermined set of entries 
within a buffering resource to a first thread and allocating a second 
predetermined set of entries within the buffering resource to a second 
thread. Specifically, this may be achieved by providing two pairs of read 
and write pointers, a first pair of read and write pointers being associated 

10 with a first thread and a second pair of read and write pointers being 

associated with a second thread. The first set of read and write pointers may 
be limited to a first predetermined number of entries within a buffering 
resource, while the second set of read and write pointers may be limited to a 
second predetermined number of entries within the same buffering resource. 

15 In the exemplary embodiment, the instruction streaming buffer 106, the trace 
cache 62, and an instruction queue 103 are shown to each provide a storage 
capacity that is logically partitioned between the first and second threads. 
Each of these units is also sown to include a "shared" capacity that may, 
according to respective embodiments, be dynamically allocated to either the 

20 first or the second thread according to certain criteria. 



Trace Delivery Engine 
One embodiment of the present invention is described below as being 
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implemented within a trace delivery engine 60, and specifically with respect 
to a trace cache 62. However, it will be appreciated that the present 
invention may be applied to a partition any resources within or associated 
with a processor, and the trace delivery engine 60 is merely provided as an 

5 exemplary embodiment 

As alluded to above, the trace delivery engine 60 may function as a 
primary source of microinstructions during periods of high performance by 
providing relatively low latency and high bandwidth. Specifically, for a 
CISC instruction set, such as the Intel Architecture x86 instruction set, 

10 decoding of macroinstructions to deliver microinstructions may introduce a 
performance bottleneck as the variable length of such instructions 
complicates parallel decoding operations. The trace delivery engine 60 
attempts to address this problem to a certain extent by providing for the 
caching of microinstructions, thus obviating the need for microinstructions 

15 executed by the execution unit 17 to be continually decoded. 

To provide high-performance sequencing of cached 
microinstructions, the trace delivery engine 60 creates sequences of entries 
(or microinstructions) that may conveniently be labeled "traces". A trace 
may, in one embodiment, facilitate sequencing in that the address of a 

20 subsequent entry can be known during a current access operation, and 

before a current access operation is complete. In one embodiment, a trace of 
microinstructions may only be entered through a so-called "head" entry, that 
includes a linear address that determines a set of subsequent entries of the 



042390.P4740 



-21- 



trace event stored in successive sets, with every entry (except a tail entry) 
containing a way pointer to a next entry. Similarly, every entry (except a 
head entry) contains a way pointer to a previous entry. 

In one embodiment, the trace delivery engine 60 may implement two 
modes to either provide input thereto or output therefrom. The trace 
delivery engine 60 may implement a "build mode" when a miss occurs with 
respect to a trace cache 62, such a miss being passed on to the 
microinstruction translation engine 54. In the "build mode", the 
microinstruction translation engine 54 will then perform a translation 
operation on a macroinstruction received either from the unified cache 44, or 
by performing a memory access operation via the processor bus. The 
microinstruction translation engine 54 then provides the microinstructions, 
derived from the macroinstruction(s), to the trace delivery engine 60 which 
populates the trace cache 62 with these microinstructions. 

When a trace cache hit occurs, the trace delivery engine 60 operates in 
a "stream mode" where a trace, or traces, of microinstructions are fed from 
the trace delivery engine 60, and specifically the trace cache 62, to the 
processor back end via the microinstruction queue 68. 

Figure 4 is a block diagram showing further details regarding the 
various components of the trace delivery engine (TDE) 60 shown in Figure 2. 
The next instruction pointer 100, which forms part of the microinstruction 
translation engine 54, is shown to receive a prediction output 65 from the 
trace branch prediction unit 64. The next instruction pointer 100 provides an 
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instruction pointer output 67, which may correspond to the prediction 
output 65, to the trace cache 62. 

A trace branch address calculator (TBAC) 120 monitors the output of 
the microsequencer microinstruction queue 68, and performs a number of 
functions to provide output to a trace branch information table 122. 
Specifically, the trace branch address calculator 120 is responsible for bogus 
branch detection, the validation of branch target and branch prediction 
operations, for computing a Next Linear Instruction Pointer (NILIP) for each 
instruction, and for detecting limit violations for each instruction. 

The trace branch information table (TBIT) 122 stores information 
required to update the trace branch prediction unit 64. The table 122 also 
holds information for events and, in one embodiment, is hard partitioned to 
support multithreading. Of course, in an alternative embodiment, the table 
122 may be dynamically partitioned. 

The trace branch information table 122 provides input to a trace 
branch target buffer (trace BTB) 124 that operates to predict "leave trace" 
conditions and "end-of-trace" branches. To this end, the buffer 124 may 
operate to invalidate microinstructions. 

When operating in the above-mentioned "build mode", 
microinstructions are received into the trace cache 62 via a trace cache fill 
buffer (TCFB) 125, which is shown in Figure 4 to provide input into the trace 
cache 62. 

Figure 5 is a block diagram illustrating further architectural details of 
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the trace cache fill buffer 125. In one embodiment of the buffer 125 includes 
first and second buffers 134 and 136, each of which is dedicated to a specific 
thread (e.g., thread 0 and thread 1). Each of the buffers 134 and 136 provides 
four (4) entries for an associated thread, and outputs microinstructions to a 
5 staging buffer 138, from where the microinstructions are communicated to 
the trace cache 62. The trace cache fill buffer 125 implements a build 
algorithm in hardware that realizes the "build mode", and provides 
microinstruction positioning, and the detection of "end-of-line" and "end-of- 
trace" conditions. 

10 The trace cache 62 is shown in Figure 4 to include a data array 128 

and an associated tag array 126. The data array 128 provides a storage for, 
in one embodiment, 12 KB of microinstructions. 

Figure 6 is a block diagram illustrating further architectural details 
pertinent to the trace cache 62. The thread selection logic 140 implements a 

15 thread selection state machine that, in one embodiment, decides on a cycle- 
by-cycle basis which of multiple threads (e.g., thread 0 or thread 1) is 
propagated to subsequent pipe stages of a processor 30. 

Figure 6 also illustrates the partitioning of the trace cache 62 into 
three portions (or sections), namely a first portion 148 dedicated to a first 

20 thread, a second portion 152 dedicated to a second thread, and a third 

portion 150 that is dynamically shared between the first and second threads. 
In the exemplary embodiment, each of the first and second portions 148 and 
152 comprises two (2) ways of the data array 128 (and the associated tag 
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array 126) of the trace cache 62. The third, shared portion 150 constitutes 
four (4) of the data array 128, and the associated tag array 126. The 
illustrated partitioning of the trace cache 62 is implemented by victim 
selection logic 154, which will be described in further detail below. 

Figure 7 is a block diagram illustrating an exemplary structure of the 
trace cache 62, according to one embodiment. Each of the tag array 126 and 
the data array 128 are each shown to comprise an eight-way, set associative 
arrangement, including 256 sets thus providing a total of 2048 entries within 
each of the tag and data arrays 126 and 128. Each entry 148 within the tag 
array 126 is show to store, inter alia, tag field information 151, a thread bit 
153, a valid bit 155 and a Least Recently Used (LRU) bit 240 for each 
corresponding entry 156 within the data 128. The thread bit 153 marks the 
data within the associated entry 156 as belonging, for example, to either a 
first or a second thread. The valid bit 155 marks the data within the 
corresponding entry 156 of the data array 128 as being valid or invalid. 

One embodiment of the trace cache 62 may also include a further 
minitag array 127, as illustrated in Figure 7, that is a subset of the full tag 
array 126 and that is utilized to perform high-speed tag match operations 
and for reducing power consumption related to performing a lookup with 
respect to the trace cache 62. A hit on the minitag array 127 may be 
regarded as "mutually exclusive", as will be described in further detail 
below. 
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Thread Selection Logic 

Dealing first with the thread selection logic 140, which determines, 
inter alia, the output of the trace cache 62, Figure 8 is a block diagram 
illustrating the various inputs and outputs of the thread selection logic 140. 
The thread selection logic 140 is shown to take inputs from (1) a trace cache 
build engine 139, located in the microinstruction translation engine interface, 
(2) the microinstruction queue 68 and (3) trace cache/microsequencer 
control logic 137. Utilizing these inputs, the thread selection logic 140 
attempts to generate an advantageous thread selection (e.g., thread 0 or 
thread 1) for a particular cycle. Thread selection, in one embodiment, is 
performed on a cycle-by-cycle basis and attempts to optimize performance 
while not starving either thread of processor resources. 

The output of the thread selection logic 140 is shown to be 
communicated to the microcode sequencer 66, the trace branch prediction 
unit 60 and the trace cache 62 to affect thread selection within each of these 
units. 

Figure 9 is a block diagram illustrating three components of the 
thread selection logic 140, namely a thread selection state machine 160, a 
counter and comparator 162 for a first thread (e.g., thread 0) and a further 
counter and comparator 164 for a second thread (e.g., thread 1). 

The thread selection state machine 160 is shown to receive build and 
mode signals 161, indicating whether the processor is operating in a 
multithreaded (MT) or a single threaded (ST) mode and if operating in a 
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multithreaded mode, indicating whether or not each thread is in a build 
mode. The thread selection state machine 160 is also shown to receive 
respective full inputs 172 and 174 from the counter and comparator units 162 
and 164. The full signals 172 and 174 indicate whether a threshold number 
of microinstructions for a particular thread are within the trace delivery 
engine 160. In one embodiment, each of the units 162 and 164 allow a total 4 
X 6 microinstruction lines within the trace delivery engine 60. The full 
signals 172 and 174 are routed to all the units within the trace delivery 
engine 160, responsive to which such units are responsible for recycling their 
states. Each of the counter comparator units 162 and 164 is shown to receive 
a queue deallocation signal 166 from the microcode sequencer 66, a 
collection of clear, nuke, reset and store signals 168 and valid bits 170 from 
the trace cache tag array 126. 

Figure 10 is a state diagram illustrating operation of the thread 
selection state machine 160, illustrated in Figure 9. When in multithreading 
mode, the state machine attempts to time-multiplex multiple threads on a 
cycle-by-cycle basis. When a thread encounters a relatively long stall, the 
state machine 160 attempts to provide full bandwidth to the thread that has 
not stalled. When multiple threads (e.g., thread 0 and thread 1) experience 
long latency stalls, the state machine 160 may, in certain circumstances, 
require a one-cycle bubble (e.g., if both threads are stalled and the state 
machine 160 is in "thread 0" state and a "thread 1" stall is removed). 

Referring back to Figure 6, it will be noted that the selection signal 
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141, outputted from the thread selection logic 140, is not itself regarded as a 
"valid bit", but is rather used as a 2-1 MUX selection control to the MUX 142. 
The MUX 142 operates to select between control signals outputted from a 
first thread control 144 and a second thread control 146. The outputs of the 

5 controls 144 and 146 are dependent upon valid bits being set for the relevant 
threads. For example, the selection signal 141 may indicate a thread entry 
for a particular thread (e.g., thread 0) to be outputted from the trace cache 
62. However, the valid bit for the relevant entry may be set to 0, indicating 
an invalid entry. 

10 Victim Selection Logic 

The partitioning of the trace cache 62, as illustrated in Figure 6, may, 
in one embodiment, be implemented by the victim selection logic 154. The 
victim selection logic 154 is responsible for identifying the way (in both the 
tag array 126 and the data array 128) to which a microinstruction is written. 

15 Figure 11 is a block diagram illustrating architectural details of one 

embodiment of the victim selection logic 154. The victim selection logic 154 
is shown to include minitag victim selection logic 180, valid victim selection 
logic 182 and Least Recently Used (LRU) victim selection logic 184. A 
priority multiplexing operation is performed on the outputs of the selection 

20 logics 180, 182 and 184 by a priority MUX 186. The priority ordering 
implemented by the priority MUX 186 is as follows: 

1. Minitag victim; 

2. Valid victim; and 
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3. LRU victim. 

A multi-threaded latch structure 190 is used to pass the results of the 
priority MUX to the trace cache 62. 

Figure 12 is a flow chart illustrating an exemplary method 200, 
according to one embodiment, of partitioning a memory resource, such as 
for example, the trace cache, within a multi-threaded processor. The 
operation of the various units of the victim selection logic 154 illustrated in 
Figure 11 will be described with reference to the flow chart shown in Figure 
12. 

The method 200 commences at block 202 where the minitag victim 
selection logic 180 performs a minitag victim determination with respect to 
the minitag array 127. Specifically, the logic 180 attempts to identify a 
conflict between an existing valid minitag array entry and a current 
instruction pointer (e.g., the current Linear Instruction Pointer (CLIP)). 

At decision box 204, a determination is made as to whether a minitag 
victim was located at block 202. If so, the method 200 advances to block 212, 
where relevant trace cache data (e.g., a microinstruction) is written to the 
identified victim entry within the trace cache 62. As a minitag hit is 
regarded as being "mutually exclusive", an identified minitag victim is given 
the highest priority by the victim selection logic 154. 

Following a negative determination at decision box 204, at block 206, 
a valid victim determination operation is performed by the valid victim 
selection logic 182. This operation involves simply identifying an invalid 
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entry within the trace cache 62 by examining valid bits 155 stored within the 
tag array 126 of the trace cache 62. Following a positive determination at 
decision box 208, the method 200 advances to block 212. On the other hand, 
following a negative determination (i.e., no invalid entries are identified) at 

5 decision box 208, the method 200 proceeds to box 210, where a LRU victim 
determination operation is performed. Following completion of the 
operation at block 210, the method 200 again advances to block 212. The 
method 200 then terminates at step 214. 

Figure 13 is a flow chart illustrating an exemplary method 210, 

10 according to one embodiment, of partitioning a resource, in the exemplary 
form of a memory resource, utilizing a LRU history associated with the 
relevant memory resource. 

Figure 14 is a block diagram illustrating an exemplary LRU history 
240 that may be utilized in the performance of the method 210, the execution 

15 of which will be described with reference to Figure 10. 

The method 210 commences at block 222 with the receipt of a 
microinstruction, and associated tag information, at the victim selection 
logic 154. 

At block 224, a set into which the microinstruction may potentially be 
20 written is identified (e.g., by a write pointer). 

At block 226, having identified a victim set, the LRU victim selection 
logic 184 examines the LRU history for the relevant set. Figure 14 illustrates 
the LRU history 240, as maintained within the tag array 126 of the trace 
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cache 184, the LRU history 240 containing a LRU history for each set within 
the data array 128. 

At decision box 228, the LRU victim selection logic 184 determines 
whether the tail entry, indicating a specific way within the set, is available to 
a relevant thread (e.g., thread 0 or thread 1). As mentioned above, in an 
exemplary embodiment, ways 0 and 1 may be available exclusively to a first 
thread (e.g., thread 0), ways 6 and 7 may be available exclusively to a second 
thread (e.g., thread 1) and ways 2-5 may be dynamically shared multiple 
threads. Referencing the exemplary LRU history for a set N, way 6 is 
indicated by the tail entry as being the least recently used way in the 
relevant set N. Assume, for example, that the microinstruction to be cache 
belongs to a first thread (e.g., thread 0) in which way 6 would not be 
available to receive the microinstruction on account of way 6 having been 
dedicated exclusively to the storage of microinstructions for a second thread 
(e.g., thread 1). 

Returning to Figure 13, following a negative determination at 
decisions box 228, the LRU victim selection logic 184 proceeds to examine 
entries within the LRU history 252 for the relevant set behind the tail entry 
to identify a way that may receive the microinstruction for the relevant 
thread. As indicated at block 230, the LRU victim selection logic 184 
examines a predetermined set M of tail entries (e.g., the three entries closest 
to the tail of the LRU history 252 for the set) to locate a way, closest to the 
tail of the LRU history, that is available to the relevant thread. 
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In the example provided in Figure 14, the next-to-last entry within the 
LRU history for the relevant set identifies way 3 which, under the scheme 
described above, would be available to receive a microinstruction for a first 
thread (e.g., thread 0) as way 3 is located in the "shared" portion of the trace 
5 cache 62. 

Figure 14 illustrates how the entry for way 3, within the LRU history 
252 for the relevant set, is moved to the head of the LRU history 252 on 
account of this way being designated for storage of the relevant 
microinstruction. 

10 Returning to the flow chart in Figure 13, at block 232, the victim entry 

(i.e., the victim way) within the relevant set that is available to the relevant 
thread is identified, and the microinstruction written to that way within the 
set. The method 220 then ends at step 234. 

Figure 15 is a block diagram illustrating further details regarding the 

15 inputs to, and output from, the victim selection logic 184. The victim 

selection logic 184, in one embodiment, comprises discrete logic components 
that implement the methodology described above. In an alternative 
embodiment, the victim selection logic 184 may execute code to implement 
the described methodology. Specifically, the logic is shown to receive a 7- 

20 bit pending multi-thread (PENDING_MT) signal 250, a 28-bit least recently 
used (LRU) signal 252, a second thread status (NT1) signal 254 and a first 
thread status (MT0) signal 256 as inputs. The signal 250 indicates the way 
selected to receive a micro-instruction of a current thread or further thread 
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(other than a thread currently being considered) by the selection logic 184 as 
indicated by the selection logic 184 during a previous victim selection 
operation, or as determined by further victim selection logic 154 associated 
with the further thread. The signal 250 is utilized by the LRU victim 

5 selection logic 184 to insure that the selection logic 184 does not "doubly 
select" the same way between two threads, or that multiple LRU victim 
selection logics 184 do not select the same way between two threads. To this 
end, the victim selection logic 184 implements discrete logic that prevents it 
from selecting the same way as indicated by the signal 250. 

10 The signal 250 accordingly, in one embodiment, indicates the way 

that was previously selected as a victim, while the LRU signal 252 provides 
the LRU history 252 for the relevant set to the logic 184. The status signals 
254 and 256 indicate to the logic 184 which of the threads are "alive" or 
executing within a processor 30. The logic 184 then outputs a 7-bit selection 

15 signal 260 for a relevant set, indicating the way within a relevant set to 
which the microinstruction should be written for caching purposes within 
the trace cache 62. 

By implementing a pseudo-dynamic partitioning of a resource, such 
as the trace cache 62, the present invention ensures that a certain 

20 predetermined minimum threshold of the capacity of a resource is always 
reserved and available for a particular thread within a multithreaded 
processor. Nonetheless, by defining a "shared" portion that is accessible to 
both threads, the present invention facilitates dynamic redistribution of a 
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resource's capacity between multiple threads according to the requirements 
of such threads. 

Further, the LRU victim selection methodology discussed above 
enables hits to occur on ways allocated to a further thread, but simply 

5 disallows the validation of such a hit, and forces the LRU victim selection 
algorithm to select a further way, according to an LRU history, that is 
available to a particular thread. 

As mentioned above, the logic for implementing any one of the 
methodologies discussed above may be implemented as discrete logic 

10 within a functional unit, or may comprise a sequence of instructions (e.g., 
code) that is executed within the processor to implement the method. The 
sequence of instructions, it will be appreciated, may be stored on any 
medium from which it is retrievable for execution. Examples of these 
mediums may be a removable storage medium (e.g., a diskette, CD-ROM) or 

15 a memory resource associated with, or included within, a processor (e.g., 
Random Access Memory (RAM), cache memories or the like). Accordingly, 
any such medium should be regarded as comprising a "computer-readable" 
medium and may be included in a processor, or accessible by a processor 
employed within a computer system. 

20 Thus, a method and apparatus for partitioning a processor resource 

within a multi-threaded processor have been described. Although the 
present invention has been described with reference to specific exemplary 
embodiments, it will be evident that various modifications and changes may 
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be made to these embodiments without departing from the broader spirit 
scope of the invention. Accordingly, the specification and drawings are to 
be regarded in an illustrative rather than a restrictive sense. 
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WHAT IS CLAIMED: 

11. A method including: 
2 

3 dedicating a first portion of a resource exclusively to a first thread; 

4 

5 dedicating a second portion of the resource exclusively to a second 

6 thread; and 
7 

8 dynamically sharing a third portion of the resource between the first 

9 and second threads. 

1 2. The method of claim 1 wherein the dynamic sharing of the third 

2 portion of the resource is performed according to resource demands of the 

3 respective first and second threads. 

1 3. The method of claim 1 wherein the resource comprises a memory 

2 resource including first and second portions dedicated to the first and 

3 second threads respectively and a third portion shared between the first and 

4 second threads, the method including: 
5 

6 identifying a first location within the memory resource as a candidate 

7 location to receive an information item associated with the first 
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8 thread; 
9 

10 determining whether the candidate location is within the first or the 

11 third portion of the memory resource dedicated to the first thread; 
12 

13 if the candidate location is within the first or the third portion of the 

14 memory resource, then storing the information associated the first 

15 thread at the candidate location; and 
16 

17 if the candidate location is within the second portion of the memory 

18 resource then identifying a further location as being the candidate 

19 location. 

1 4. The method of claim 3 wherein the memory resource comprise a N 



2 way set associative memory and wherein the first portion comprises a first 

3 way dedicated to the first thread, the second portion comprises a second 

4 way dedicated to the second thread and the third portion comprises a third 

5 way shared between the first and second threads, wherein the identification 

6 of the first location as the candidate location comprises identifying a selected 

7 way within a selected set of the memory as a candidate way to receive the 

8 information item associated with the first thread. 

1 5. The method of claim 4 wherein the identification of the further 
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location as the candidate location comprises identifying a further way within 
the selected set of the memory as the candidate way to receive the 
information item associated with the first thread. 

6. The method of claim 4 wherein the identification of the selected way 

2 within the selected set as the candidate way comprises identifying a way 

3 within the select set that was least recently used. 

1 7. The method if claim 5 wherein the identification of the further way 

2 within the selected set a candidate way comprises identifying a way within 

3 the selected set that was second-least recently used. 

1 8. The method of claim 6 including examining a Least Recently Used 

2 (LRU) history for the selected set to identify the way that was least recently 

3 used. 

1 9. The method of claim 8 including examining a set of entries within the 

2 LRU history for the selected set, each entry within the set of entries 

3 indicating a respective way within the selected set, wherein the set of entries 

4 is ordered in a sequence determined by least recent usage of a respective 

5 way and the selection of the candidate way comprises performing a 

6 sequential examination of the entries of the set of entries to locate a least 

7 recently used way that comprises either the first or the second way. 
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1 10. The method of claim 4 wherein memory comprises a trace cache 

2 memory, and wherein the information item associated with the first thread 

3 comprises a microinstruction of the first thread. 



1 11. A resource comprising: 
2 

3 a first portion dedicated for utilization by a first thread executing 

4 within a multi-threaded processor; 
5 

6 a second portion dedicated to utilization by a second thread 

7 executing within the multi-threaded processor; and 
8 

9 a third portion shared by the first and second threads. 



1 12. The resource of claim 11 wherein the resource comprises a memory 

2 including selection logic to identify a first location selection logic to identify 

3 a first location within the memory resource as a candidate location to receive 

4 an information item associated with the first thread, to determine whether 

5 the candidate location is within the first or third portions of the memory 

6 resource, then to store the information associated the first thread at the 

7 candidate location but, if candidate location is within the second portion of 

8 the memory resource, then to identify a further location as being the 
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9 candidate location. 

1 13. The resource of claim 12 comprising a N way set associative memory 

2 and within the first portion comprises a first way dedicated to the first 

3 thread, the second portion comprises a second way dedicated to the second 

4 thread and the third portion comprises a third way shared between the first 

5 and second threads. 

1 14. The resource of claim 12 wherein the selection logic identifies a 

2 selected way within a selected set of the memory as a candidate way to 

3 receive the information item associated with the first thread if the selected 

4 way comprises either the first or the third say. 

1 15. The resource of claim 12 wherein the selection logic identifies a 

2 further way within the selected set of the memory as the candidate way to 

3 receive the information item associated with the first thread if the selected 

4 way comprises the second way. 

1 16. The resource of claim 14 wherein the selection logic identifies the 

2 selected way within the selected set as the candidate way by identifying the 

3 selected way within the select set as a last recently used way within the 

4 selected set. 
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1 17. The resource of claim 15 wherein the selection logic identifies the 

2 further way within the selected set a candidate way by identifying the 

3 further way within the selected set as a second-least recently used way 

4 within the selected set. 

1 18. The resource of claim 16 wherein the selection logic examines a Least 

2 Recently Used (LRU) history for the selected set to identify the way that was 

3 least recently used. 

1 19. The resource of claim 18 wherein the selection logic examines a set of 

2 entries within the LRU history for the selected set, each entry within the set 

3 of entries indicating a respective way within the selected set, wherein the set 

4 of entries is ordered in a sequence determined by least recent usage of a 

5 respective way and the selection of the candidate way comprises performing 

6 a sequential examination of the entries of the set of entries to locate a least 

7 recently used way that comprises either the first or the second way. 

1 20. The resource of claim 18 wherein the memory comprising a trace 

2 cache memory, and wherein the information item associated with the first 

3 thread comprises a microinstruction of the first thread. 

1 21. Selection logic including: 
2 
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3 first means for identifying a first location within a memory resource, 

4 associated with a multi-threaded processor as a candidate location to 

5 receive an information item associated with a first thread; 
6 

7 second means for determining whether the candidate location is 

8 within a second portion of the memory resource dedicated to the 

9 second thread; 



10 

11 wherein, if the candidate location is within the second portion of the 

12 memory resource dedicated to second thread, the first means identifies a 

13 further location within the memory resource as the candidate location. 

1 22. The selection logic of claim 21 wherein the memory resource 

2 comprises an N way set associative memory and wherein the first portion 

3 comprises a first way dedicated to the first thread, the second portion 

4 comprises a second way dedicated to the second thread and the third 

5 portion comprises a third way shared between the first and second threads, 

6 and wherein the first means identifies a selected way within a selected set of 

7 the memory as a candidate way to receive information not associated with 

8 the first way. 

1 23. The selection logic of claim 22 wherein the first means identifies a 

2 further way within the selected set of the memory as the candidate way to 
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3 receive information associated with the first thread. 



I 24. A method including: 
2 

3 defining a memory resource, associated with a multi-threaded 

4 processor, to include first and second portions dedicated to the first 

5 and second threads respectively and a third portion shared between 

6 the first and second threads; 
7 

8 for an information item associated with the first thread, examining a 

9 history of least recently used portions to identify either the first or the 
10 third portion as being a least recently used portion available to the 

II first thread; and 
12 

13 storing the information item within the least recently used portion. 
1 25. The method of claim 24 wherein, for the information item associated 



2 with the first thread, the second portion is excluded from the identification 

3 as the least recently used portion on account of being dedicate to the second 

4 thread. 

1 26. The method of claim 24, wherein the memory resource comprises a N 

2 way set associative cache memory and wherein the first, second and third 
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3 portions comprising respective first, second and third ways. 

1 27. The method of claim 26 wherein the examination of the history of 

2 least recently used portions includes examining a least recently used history 

3 for a selected set of the set associative cache memory. 

1 28. The method of claim 24 wherein the cache memory comprises a trace 

2 cache memory, and wherein the information item associated with the first 

3 thread comprises a microinstruction of the first thread. 

1 29. A computer-readable medium storing a sequence of instructions that, 

2 when executed within a processor, causes the processor to perform the steps 

3 of: 
4 

5 dedicating a first portion of a resource exclusively to a first thread; 

6 

7 dedicating a second portion of the resource exclusively to a second 

8 thread; and 
9 

10 dynamically sharing a third portion of the resource between the first 

11 and second threads. 

1 30. The computer readable medium of claim 29 wherein the dynamic 
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2 sharing of the third portion of the resources is performed according to 

3 resource demands of the respective first and second threads. 
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ABSTRACT OF THE DISCLOSURE 

A method of partitioning a memory resource, associated with a multi- 
threaded processor, includes defining the memory resource to include first 
and second portions that are dedicated to the first and second threads 

5 respectively. A third portion of the memory resource is then designated as 
being shared between the first and second threads. Upon receipt of an 
information item, (e.g., a microinstruction associated with the first thread 
and to be stored in the memory resource), a history of Least Recently Used 
(LRU) portions is examined to identify a location in either the first or the 

10 third portion, but not the second portion, as being a least recently used 

portion. The second portion is excluded from this examination on account 
of being dedicated to the second thread. The information item is then stored 
within a location, within either the first or the third portion, identified as 
having been least recently used. 




ALLOCATE 



I 



18 

EXECUTE 



RETIRE 



20 



TO PROCESSOR 
BUS 




40 BUS 



32 



BUS INTERFACE UNIT 
36 



FSB LOGIC 



34 



BUS QUEUE 



SNOOP/ 



RETURN 



I 



r 



42 



I 



BUS • 
REQUES' 



38 



r 



MEMORY EXECUTION UNIT 
44 ^46 ^48 



UNIFIED CACHE 
(LEVEL 1) 



DATA 
TLB 



MEMORY 
ORDERING 



RAW 
INSTRUCTIONS 



100 



I 




r 



54 



A INSTRUCTION FETCH 
T REQUEST 



NIP 



MICROINSTRUCTION TRAN SLATION ENGINE 



102 

— DECODED 
56 INSTR UCTIONS 



ifLB~| I ISB | 19 6 Jj IS | [DECODE 



FIG. 2 



I 



^60 



I 



J14 
112 



STALL' 



58 



TRACE DELIVERY ENGINE (TDE) 



TRACE 




TRACE 




MICRO-CODE 


CACHE 




BTB 




SEQUENCER 


62 




64 







MS UOP QUEUE 



.66 



IN-ORDER 
FRONT 
END 

A 



DECODED 
I INSTRUCTIONS 



30 



71 



72 



74 



76 



•v— 68 

~~~~zq _|l TA l L C?" 8 1 "L ~ ~ ~_~ _ 



SCHEDULER 
(RESERVATION 
STATIONS) 



RENAMER 



ALLOCATOR 



EXECUTION UNIT 



FLOATING 
POINT/MMX 
EXECUTION 
ENGINE 



INTEGER 
EXECUTION 
ENGINE 



DATA CACHE 
(LEVEL 0) 



84 



86 



88 



78 

L 



REORDER 
BUFFER 
(CHECKER) 



REPLAY QUEUE 



80 



I 



OUT-OF-ORDBR 
BACK 
END | 



UNIFIED CACHE 
(LEVEL 1) 



NEXT IP 



I 



INSTRUCTION STREAMING BUFFER 



THREAD 0 




THREAD 1 



I 



SHARED 



DECODER 



I 



TRACE CACHE 









THREAD 0 




THREAD 1 





I 



SHARED 



RENAMER/ALLOCATOR 



IQ 



THREAD 0 




THREAD 1 



I 



SHARED 



SCHEDULER 



I 



EXECUTION UNIT 



RETIREMENT LOGIC 





'//// 




THREAD 0 




THREAD 1 



SHARED 



106 



62 



103 



FIG. 3 




u. 



CO 



J 



J 



o 
o 



< LL CD 
O 3 U_ 
UU CD O 

CCiZ 



CM 



J 



lu ir 

Q LU 

o o 
o 2: 
oyj 

go 
§8 



CD 
CD 



O 
CO 



LU UJ 




CL LU 
O 3 
3 LU 




CO 

d 



66 



MS 



CONTROL 
THREAD 0 



NEXTUIP 
THREAD 0 



CONTROL 
THREAD 1 



NEXT UIP 
THREAD 1 



68 



60 

:tbpu 



62 





CONTROL 
THREAD 0 



CONTROL 
THREAD 1 



NEXT IP 
THREAD 0 



NEXT IP 
THREAD 1 




MS UOP QUEUE 



140 




^tc 



CONTROL 
THREAD 0 



CONTROL 
THREAD 1 



TC ENGINE 
(MITE INTERFACE) 



137 



THREAD 
SELECTION 
LOGIC 

1 



139 



TC/MS 




CONTROL LOGIC 





FIG. 8 



CO 



<c o 

LU DC 
Q U_ 

LU ^ 



oo 

CO 



LU < _ 

^ 00 



^ DC h= 

LU h— 
— 1 LU 
O CO 

cr lu 

O DC 



\2 



CD 
Q 
—1 

> 





DC 
O 
LU 
Q 
O 



CO 



DC 

o 

>- 
o 
< 

LU 

DC ^ 

O LU 
LU ^ ^ 

Q uj 
OQ=d 

, ^ LU 

< 00 DC 



< 
LU 

DC 




CD 



A 



Q h- 

< O 

LU UJ 

DC —l 
X LU 
h=CO 



X 

o 



< 

LU 
DC 
Q 
< 
UJ 

II 

>■ DC- 
I— LU 
CO § 

< 





YES 



VALID VICTIM 
LOCATED? 



208 



i 



NO 



LRU VICTIM 
DETERMINATION 



I 



210 



212 



WRITE TRACE CACHE 
DATA TO VICTIM 
ENTRY OF TRACE 
CACHE 




214 



FIG. 12 




RECEIVE 
INSTRUCTION 
INFORMATION FOR 
THREAD N TO BE 

CACHED 



222 



I 



224 



LOCATE SET 



YES 



226 



EXAMINE LRU 
HISTORY FOR SET 




EXAMINE M TAIL 
ENTRIES TO LOCATE 
ENTRY AVAILABLE 
TO THREAD N 



IDENTIFY VICTIM 
ENTRY IN SET 
AVAILABLE TO 
THREAD N 



FIG. 13 



232 




PENDING_MT<7:0>. 

LRU <27:0> 
MT1 
MTO 



250 



184 



260 



LRU VICTIM 
SELECTION LOGIC 



W0-W7 



FIG. 15 



Attorney's Docket No.: 042390.P4740 Patent 
DECLARATION AND POWER OF ATTORNEY FOR PATENT APPLICATION 



As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below, next to my name. 

I believe I am the original, first, and sole inventor (if only one name is listed below) or an original, 
first, and joint inventor (if plural names are listed below) of the subject matter which is claimed and 
for which a patent is sought on the invention entitled 

METHOD AND APPARATUS FOR PARTITIONING A RESOURCE BETWEEN 

MULTIPLE THREADS WITHIN A MULTI-THREADED PROCESSOR 

the specification of which 

X_ is attached hereto. 

was filed on as 

United States Application Number 

or PCT International Application Number 

and was amended on - 

(if applicable) 

I hereby state that I have reviewed and understand the contents of the above-identified 
specification, including the claim(s), as amended by any amendment referred to above. I do not 
know and do not believe that the claimed invention was ever known or used in the United States of 
America before my invention thereof, or patented or described in any printed publication in any 
country before my invention thereof or more than one year prior to this application, that the same 
was not in public use or on sale in the United States of America more than one year prior to this 
application, and that the invention has not been patented or made the subject of an inventor's 
certificate issued before the date of this application in any country foreign to the United States of 
America on an application filed by me or my legal representatives or assigns more than twelve 
months (for a utility patent application) or six months (for a design patent application) prior to this 
application. 

I acknowledge the duty to disclose all information known to me to be material to patentability as 
defined in Title 37, Code of Federal Regulations, Section 1 .56. 

I hereby claim foreign priority benefits under Title 35, United States Code, Section 119(a)-(d), of any 
foreign application(s) for patent or inventor's certificate listed below and have also identified below 
any foreign application for patent or inventor's certificate having a filing date before that of the 
application oh which priority is claimed: 
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Prior Foreign Application(s) 



Priority 
Claimed 



(Number) (Country) (Day/Month/Year Filed) Yes No 



(Number) (Country) (Day/Month/Year Filed) Yes No 



(Number) (Country) (Day/Month/Year Filed) Yes No 

I hereby claim the benefit under title 35, United States Code, Section 119(e) of any United States 
provisional application(s) listed below: 



(Application Number) Filing Date 



(Application Number) Filing Date 



I hereby claim the benefit under Title 35, United States Code, Section 120 of any United States 
application(s) listed below and, insofar as the subject matter of each of the claims of this application 
is not disclosed in the prior United States application in the manner provided by the first paragraph 
of Title 35, United States Code, Section 1 12, 1 acknowledge the duty to disclose all information 
known to me to be material to patentability as defined in Title 37, Code of Federal Regulations, 
Section 1 .56 which became available between the filing date of the prior application and the national 
or PCT international filing date of this application: 



(Application Number) Filing Date (Status » patented, 

pending, abandoned) 



(Application Number) Filing Date (Status » patented, 

pending, abandoned) 

I hereby appoint the persons listed on Appendix A hereto (which is incorporated by reference and a 
part of this document) as my respective patent attorneys and patent agents, with full power of 
substitution and revocation, to prosecute this application and to transact all business in the Patent 
and Trademark Office connected herewith. 

Send correspondence to Andre L. Marais , BLAKELY, SOKOLOFF, TAYLOR & 

(Name of Attorney or Agent) 
ZAFMAN LLP, 12400 Wilshire Boulevard 7th Floor, Los Angeles, California 90025 and direct 

telephone calls to Andre L. Marais , (408) 720-8300. 

(Name of Attorney or Agent) 
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1 hereby declare that ail statements made herein of my own knowledge are true and that all 
statements made on information and belief are believed to be true; and further that these 
statements were made with the knowledge that willful false statements and the like so made 
are punishable by fine or imprisonment, or both, under Section 1001 of Title 18 of the United 
States Code and that such willful false statements may jeopardize the validity of the 
application or any patent issued thereon. 

Full Name of Sole/First Inventor Chan W. Lee 



Inventor's Signature Date „ 



Residence Portland.Oreaon Citizenship South Korea 



(City, State) (Country) 



Post Office Address 1730 SW Harbor Way #605 



Portland. OR 97201 



Full Name of Second/Joint Inventor Glenn Hinton 



inventor's Signature _ Date . 



Residence Portland. Oregon Citizenship USA 



(City, State) (Country) 



Post Office Address 6130 NW 185 th Ave. 



Portland. OR 97229 



Full Name of Third/Joint Inventor Robert Krick 



Inventor's Signature Date . 



Residence Fort Collins Colorado Citizenship USA 



(City, State) (Country) 



Post Office Address 3513 Red Mountain Drive 



Fort Collins. CO 80525 



Full Name of Fourth/Joint Inventor. 



Inventor's Signature . Date . 

Residence Citizenship . 



(City, State) (Country) 
Post Office Address 
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APPENDIX A 



William E. Alford, Reg. No. 37,764; Farzad E. Amini, Reg. No. P42.261 ; Aloysius T. C. AuYeung, Reg. No. 
35,432; William Thomas Babbitt, Reg. No. 39,591; Carol F. Barry, Reg. No. 41,600; Jordan Michael 
Becker, Reg. No. 39,602; Bradley J. Bereznak, Reg. No. 33,474; Michael A. Bernadicou, Reg. No. 35,934; 
Roger W. Blakely, Jr., Reg. No. 25,831; Gregory D. Caldwell, Reg. No. 39,926; Ronald C. Card, Reg. No. 
44,587; Andrew C. Chen, Reg. No. 43,544; Thomas M. Coester, Reg. No. 39,637; Alin Corie, Reg. No. 
P46,244; Dennis M. deGuzman, Reg. No. 41,702; Stephen M. De Klerk, under 37 C.F.R. § 10.9(b); 
Michael Anthony DeSanctis, Reg. No. 39,957; Daniel M. De Vos, Reg. No. 37,813; Robert Andrew Diehl, 
Reg. No. 40,992; Sanjeet Dutta, Reg. No. P46.145; Matthew C. Fagan, Reg. No. 37,542; Tarek N. Fahmi, 
Reg. No. 41,402; Param'rta Ghosh, Reg. No. 42,806; James Y. Go, Reg. No. 40,621; James A. Henry, 
Reg. No. 41,064; Willmore F. Holbrow III, Reg. No. P41.845; Sheryl Sue Holloway, Reg. No. 37,850; 
George W Hoover II, Reg. No. 32,992; Eric S. Hyman, Reg. No. 30,139; William W. Kidd, Reg. No. 
31,772; Sang Hui Kim, Reg. No. 40,450; Eric T. King, Reg. No. 44,188; Erica W. Kuo, Reg. No. 42,775; 
Kurt P. Leyendecker, Reg. No. 42,799; Michael J. Mallie, Reg. No. 36,591; Andre L. Marais, under 37 
C.F.R. § 10.9(b); Paul A. Mendonsa, Reg. No. 42,879; Darren J. Milliken, Reg. 42,004; Lisa A. Norris, 
Reg. No. 44,976; Chun M. Ng, Reg. No. 36,878; Thien T. Nguyen, Reg. No. 43,835; Thinh V. Nguyen, 
Reg. No. 42,034; Dennis A. Nicholls, Reg. No. 42,036; Daniel E. Ovanezian, Reg. No. 41,236; Marina 
Portnova, Reg. No. P45.750; Babak Redjaian, Reg. No. 42,096; William F. Ryann, Reg. 44,313; James 
H Salter, Reg. No. 35,668; William W. Schaal, Reg. No. 39,018; James C. Scheller, Reg. No. 31,195; 
Jeffrey Sam Smith, Reg. No. 39,377; Maria McCormack Sobrino, Reg. No. 31,639; Stanley W. Sokoloff, 
Reg. No. 25,128; Judith A. Szepesi, Reg. No. 39,393; Vincent P. Tassinari, Reg. No. 42,179; Edwin H. 
Taylor, Reg. No. 25,129; John F. Travis, Reg. No. 43,203; George G. C. Tseng, Reg. No. 41,355; Joseph 
A Twarowski, Reg. No. 42,191; Lester J. Vincent, Reg. No. 31 ,460; Glenn E. Von Tersch, Reg. No. 
41 ,364; John Patrick Ward, Reg. No. 40,21 6; Mark L. Watson, Reg. No. P46,322; Thomas C. Webster, 
Reg. No. P46.154; Charles T. J. Weigell, Reg. No. 43,398; Kirk D. Williams, Reg. No. 42,229; James M. 
Wu, Reg. No. 45,241; Steven D. Yates, Reg. No. 42,242; and Norman Zafman, Reg. No. 26,250; my 
patent attorneys, and Justin M. Dillon, Reg. No. 42,486; my patent agent, of BLAKELY, SOKOLOFF, 
TAYLOR & ZAFMAN LLP, with offices located at 12400 Wilshire Boulevard, 7th Floor, Los Angeles, 
California 90025, telephone (310) 207-3800, and James R. Thein, Reg. No. 31,710, my patent attorney. 
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APPENDIX B 



Title 37, Code of Federal Regulations, Section 1.56 
Duty to Disclose Information Material to Patentability 

(a) A patent by its very nature is affected with a public interest. The public interest is best served, 
and the most effective patent examination occurs when, at the time an application is being examined, the 
Office is aware of and evaluates the teachings of all information material to patentability. Each individual 
associated with the filing and prosecution of a patent application has a duty of candor and good faith in 
dealing with the Office, which includes a duty to disclose to the Office all information known to that individual 
to be material to patentability as defined in this section. The duty to disclosure information exists with respect 
to each pending claim until the claim is cancelled or withdrawn from consideration, or the application becomes 
abandoned. Information material to the patentability of a claim that is cancelled or withdrawn from 
consideration need not be submitted if the information is not material to the patentability of any claim 
remaining under consideration in the application. There is no duty to submit information which is not material 
to the patentability of any existing claim. The duty to disclosure all information known to be material to 
patentability is deemed to be satisfied if all information known to be material to patentability of any claim 
issued in a patent was cited by the Office or submitted to the Office in the manner prescribed by §§1.97(b)-(d) 
and 1.98. However, no patent will be granted on an application in connection with which fraud on the Office 
was practiced or attempted or the duty of disclosure was violated through bad faith or intentional misconduct. 
The Office encourages applicants to carefully examine: 

(1 ) Prior art cited in search reports of a foreign patent office in a counterpart application, and 

(2) The closest information over which individuals associated with the filing or prosecution of a 
patent application believe any pending claim patentably defines, to make sure that any material information 
contained therein is disclosed to the Office. 

(b) Under this section, information is materia! to patentability when it is not cumulative to 
information already of record or being made or record in the application, and 

(1) It establishes, by itself or in combination with other information, a prima facie case of 
unpatentability of a claim; or 

(2) It refutes, or is inconsistent with, a position the applicant takes in: 

(i) Opposing an argument of unpatentability relied on by the Office, or 

(ii) Asserting an argument of patentability. 

A prima facie case of unpatentability is established when the information compels a conclusion that a claim is 
unpatentable under the preponderance of evidence, burden-of-proof standard, giving each term in the claim 
its broadest reasonable construction consistent with the specification, and before any consideration is given to 
evidence which may be submitted in an attempt to establish a contrary conclusion of patentability. 

(c) individuals associated with the filing or prosecution of a patent application within the 
meaning of this section are: 

(1 ) Each inventor named in the application; 

(2) Each attorney or agent who prepares or prosecutes the application; and 

(3) Every other person who is substantively involved in the preparation or prosecution of the 
application and who is associated with the inventor, with the assignee or with anyone to whom there is an 
obligation to assign the application. 

(d) individuals other than the attorney, agent or inventor may comply with this section by 
disclosing information to the attorney, agent, or inventor. 
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042390.P4740 

IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



In re Application of: 



Chan W. Lee, et al 



Examiner: Not yet assigned 



Serial No*: New application 



Art Unit: Not yet assigned 



Filing Date: Herewith 

For: METHOD AND APPARATUS FOR 

PARTITIONING A RESOURCE BETWEEN 
MULTIPLE THREADS WITHIN A 
MULTI-THREADED PROCESSOR 

Assistant Commissioner for Patents 
Washington, D.C 20231 



I hereby appoint Andre L. Marais as my associate attorney in the above-entitled 
application, to prosecute this application, to make alterations and amendments therein, 
and to transact all business in the Patent and Trademark Office connected therewith. 

Please continue to address all future communications to Blakely, Sokoloff, Taylor 
& Zafman LLP, 12400 Wilshire Blvd., Seventh Floor, Los Angeles, CA 90025-1026. 



APPOINTMENT OF ASSOCIATE ATTORNEY 



Sir: 



Respectfully submitted, 

BLAKELY, SOKOLOFF, TAYLOR & ZAFMAN, LLP 



12400 Wilshire Boulevard 
Seventh Floor 

Los Angeles, CA 90025-1026 
(408) 720-8598 





Jordan M. Becker 
Registration No. 39,602 



